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Abstract 

Enabling users to switch interactively the viewpoint from which they are watching a static scene is an interesting functionality in 
3D transmission systems. While it opens exciting perspectives towards rich multimedia applications, it requires the development of 
specific representation and coding techniques in order to solve the new challenges imposed by interactive navigation. In particular, 
the encoder must prepare a compressed media stream that is flexible enough to enable a free selection of the multiview navigation 
path by the different media clients. Interactivity thus brings new design constraints: the encoder is unaware of the exact decoding 
process, and on the other hand the decoder has to reconstruct information from incomplete subsets of data since the system can 
obviously not transmit images for all possible viewpoints. Furthermore, the system has to satisfy some bandwidth and storage 
constraints that are in general inherently conflicting, e.g., a description that enables low transmission rate may lead to high storage 
cost. In this paper, we propose a novel multiview data representation method that tries to satisfy the above constraints. We partition 
the domain of multiview navigation into segments, which are described with a reference image (color and depth data) and some 
auxiliary information. The auxiliary information enables the client to recreate any viewpoint in the navigation segment by view 
synthesis. The decoder is then able to navigate independently in the segment without further data request to the server; it requests 
additional data only when it moves to a different sub-region. We discuss the benefits of this novel representation and further 
propose a method to optimize the partitioning of the navigation domain into independent segments, under bandwidth and storage 
constraints. Experimental results confirm the potential of the proposed representation. In particular, our system leads to similar 
compression performance as classical inter- view coding, while it provides the high level of flexibility that is required for interactive 
streaming. Our new system represents a promising alternative solution for 3D data representation in novel interactive multimedia 
services. 

Index Terms 

Multiview video coding, interactivity, data representation, navigation domain 

I. Introduction 

In an increasing number of multimedia applications, three dimensional data information can be used to provide an interactivity 
at the receiver side, where users can freely change viewpoints on their 2D displays. It enables the viewer to freely adapt his 
viewpoint to the scene content and provides a 3D sensation during the view navigation essentially due to the look around effect 
Q, El. The design of such an interactive system necessitates the development of new techniques in the different blocks of the 
3D processing pipeline, namely acquisition [3], representation (4), coding Q, transmission [6] and rendering [7]. Solutions 
that are classically used for multiview video transmission (H are no longer effective since they consider the transmission of an 
entire set of frames, which is not ideal for interactive systems with delay and bandwidth constraints. Interactive schemes would 
ideally transmit the requested images only. Hence, the challenge is to design a system that exploits the correlation between 
multiview images in order to satisfy the different users' requests without precise knowledge of data availability at decoder. 
With the classical compression techniques based on inter-image prediction with motion/disparity estimation, the problem can 
be solved with two naive approaches. Firstly, if the server is able to store all the possible encoding prediction paths between 
the achievable views, the user can receive only the required frames (with a prediction corresponding to its navigation path) 
at low bitrates. Alternatively, it is also possible to consider a real-time encoding (and thus real-time inter-image prediction) 
[9] depending on the user position. However these two solutions do not scale with the number of users and rely on infinite 
storage and delay-free communication assumptions, which are not realistic in practical settings. The challenge for realizing an 
interactive multiview system for static 3D scene transmission is thus twofold: i) decrease the storage size without penalizing 
too much the transmission rate and ii) anticipate user requests and prepare data accordingly. This has to be done considering 
of the complete system, from the data capture to the view rendering blocks, including representation and coding strategies. 

A few solutions in the literature try to optimize the trade-off between storage, bandwidth and interactive experience. A first 
category of methods optimize switching between captured views only. In other words, they adapt the structure of the inter- view 
predictions in order to provide interactivity at a moderate cost. Some of these methods are inspired by the techniques that 
have been developed to provide interactivity for mono view video. For example the concept of SP/SI frames [10] is adapted in 
[11] for view- switching. Other works propose to modify the prediction structure between the frames fl2l . fT3l by predicting 
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the user position with the help of Kalman filtering. The authors in [ 14] propose to store multiple encodings on the server and 
to adapt the transmission to the user position. This is however very costly in terms of storage. In (H), (T6lL the multi-view 
sequence is encoded with a GoGOP structure, which corresponds to a set of GOPs (Group of Pictures). The limitation of 
such methods is a fixed encoding structure that cannot be easily adapted to different configurations. In ifTTl . the problem is 
formulated so that the proposed prediction structure reaches an optimal trade-off between storage and expected streaming rate. 
The possible types of frames are intra frames and predicted frames (with the storage of different motion vectors and residuals). 
Some other techniques (T8lL fl9lL l20l rely on the idea of combining distributed source coding and inter- view prediction for 
effective multiview switching. They propose an extension of the view switching methods in a mono view framework I2TI . 
Unfortunately, all of these solutions remain limited since they restrict the navigation to a small subset of views (the captured 
ones, generally not numerous), which results in abrupt, unnatural view- switching experience. Moreover, they cannot directly 
be extended to a system that provides smooth navigation all over the scene at the receiver (with higher number of achievable 
viewpoints). 

A second category of methods try to offer free viewpoint navigation by considering a higher number of achievable views 
at the receiver. It could be obtained by simply increasing the number of captured views, which is not feasible in practice and 
not efficient in terms of redundancy in the representation. Some solutions [22] extend the previously mentioned techniques 
by introducing virtual view synthesis at the decoder. However, they remain inefficient since the obtained virtual view quality 
is low and the user navigation capacity is still limited. Other methods introduce high redundancy in the scene description by 
using a light field representation (23lL (24). They sample the navigation domain very finely, concatenate all the images and 
finally model the light rays of the scene. The view rendering performed at the receiver side with such light fields has a better 
quality and enables quite smooth navigations. However, the navigation is constrained and the data representation does not 
achieve good compression performance. In general, all the solutions that multiply the number of possible views have inherent 
redundancies in the representation, which results in an inefficient streaming system. 

It is important to note that all of these methods do not use an end-to-end system design approach. For example, while 
optimizing the coding techniques, almost none of the above works consider the constraints of the data rendering step. It results 
in data blocks with strong dependencies which are not necessarily optimized for interactive navigation. In this work, we build 
on (25l and propose a radically new design that is supported by a flexible data representation method for static 3D scenes. 
It also includes new methods for data compression and for view rendering at the receiver. The proposed solution achieves 
a high quality free-viewpoint navigation experience and limits data redundancies in the system. Instead of optimizing data 
representation for a small set of predefined viewpoints, we rather consider that free viewpoint navigation is described by a 
navigation domain with all the possible virtual camera locations. The navigation domain (ND) is divided into sub-domains called 
navigation segments, which are transmitted to the decoder upon request. Upon reception of data from a navigation segment, the 
decoder can independently create any virtual view in this sub-domain without further request to the server. It provides a high 
navigation capability at the receiver. But it also implies a complete change in the data representation in order to limit storage 
and bandwidth costs. Each navigation segment is then represented with a reference frame and some auxiliary information. 
The auxiliary information carries in a compact form the innovation inherent to new viewpoints and permits to synthesize any 
view in the navigation segment with approximately the same quality. Then, we propose to optimize the partitioning of the 
navigation domain under rate and storage constraints. We finally illustrate the performance of our system on several datasets 
with different configurations. We observe that the proposed data representation achieves good compression performance with 
high flexibility for interactive user navigation. This new method offers a promising alternative solution for the design of 3D 
systems with new modes of interactions and rich quality of experience. 

The paper is organized as follows. In Sec. [n| we introduce the principles of our system. Then, we expose in Sec. Ill 



our 



solution to optimize the partitioning of the navigation domain. Finally, in Sec. [IVj we present different simulations results that 
validate the proposed approach. 



II. Interactive multiview navigation 

A. System overview 

In an interactive system, the user is able to freely navigate between a large set of viewpoints, in order to observe a static 
scene from different virtual camera positions. It generally means that the user has to communicate with a server and to request 
data that permits reconstruction and rendering of the desired virtual views on a 2D display. Let us consider a navigation domain 
constituted by a set of viewpoints. Our system relies on a novel data representation method that goes beyond the common 
image-based representation and rather considers the global navigation domain as union of different navigation segments. So let 
us consider that the navigation domain is divided in N navigation segments, each of them being coded in a single data Di (see 
Fig. [I]). We will see later that Di is globally represented in the form of one reference ima^e and some auxiliary information. 
We further consider that a server stores all the Di, with a consecutive storage cost of J2 i=1 \Di\ (\Di\ being the size in bits 
of the data Di). A user navigating between the viewpoints is regularly transmitting its position to the server. If a user in a 
navigation segment i comes close to a border with another navigation segment j, the server transmits the data Dj to the user, 
which increments the reception rate cost by \Dj\. We see that, if the partitioning of the navigation domain is dense (i.e., N is 
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high), on the one hand, the segment size \Di\ decreases, but on the other hand, the user requests more often to the server. On 
the contrary, if N is low, the number of requests to the server decreases, but the user has to receive heavy Di. We clearly see 
that N should be determined carefully, taking into account the bandwidth and storage constraints. 

We notice that the communication between server and user is quite simple. It has to deal with data transmission only close 
to the borders of the navigation segments. Hence, this can be generalized to multiple users scenario easily. However, if, at the 
end, the number of users becomes very high, one can consider a multiple servers system, but this is out of the scope of the 
paper. 
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partition to the server I decoder 1 




Fig. 1. Navigation domain is first partitioned. Each navigation segment is encoded and stored on a server. Users interact with the server to request the 
navigation segments needed for the navigation. 



B. Navigation with 2D images 

We provide now a formal description of the interactive multiview framework that we propose in this paper. We consider a 
system that captures and transmits data of a 3D scene S to clients that can reconstruct 2D images of the scene for navigation, 
i.e., view-switching along certain directions. The scene S is described as a countable set of random variables S{ taking values 
in C 3 , where C is the set of possible color values (e.g., [0,255] 3 J] Each of these random variables can be seen as a voxel 
in the 3D space (26). The decoder reconstructs observations of the scene at different viewpoints. These observations are 2D 
images that correspond to finite sets of TV random variables xi taking their values in C 3 . The observation of the 3D scene 
from one particular viewpoint gives an image X that is obtained with a projection function associated to X. Since the depth 
information is known, we rather define the back projection function which associates a pixel of an image to a 3D point in the 
scene: 

fx : X -+ S 

x s = f x (x). 

This projection function depends on the distance between objects and camera plane (i.e., depth) and on the extrinsic and intrinsic 
parameters of the camera Q, (27) . (28), (29). In this work, we assume that each pixel in X maps to a single voxel in 3D space 
S, and reciprocally, each voxel in S maps to maximum one pixel in X (in other words, fx is a bijection of X in fx(X) C S). 
This assumption is correct as long as the 3D scene is sampled at a sufficiently high resolution, which is the scenario that we 
consider in the following. Not all the elements of S can be seen from one viewpoint. We call Sx = fx(X) the finite subset 
of S whose elements are mapped to elements of X. This is the set of elements of S that are visible in X. It naturally depends 
on the viewpoint. Our objective is to deliver enough information to the decoder, such that it can reconstruct different images 
of the scene. At the same time, the images from different viewpoints have a lot of redundancy. Ideally, to reconstruct an image 
X' knowing the image X, decoder should be sent only the non-redundant information. We define it as the innovation of X 
with respect to X': Ix,x' = Sx \ Sx f (see Fig. [2]). This innovation is due to two classical causes in view switching. First, 
disocclusions represent the most complex source of innovation. They are due to pixels that are hidden by a foreground object 
or that are out of the camera range in the first view and become visible in the second view. The disocclusions are generally not 
considered at the encoder in the literature. Existing schemes consider that they can be approximated by inpainting [30], (3D 
or partially recovered via projection from another view (27) . Although, the performance of inpainting techniques is improving, 
there still exists a problem with new objects or with frame consistency (especially when neighboring frames are not available 
in interactive systems). These two issues should be considered at the encoder and data to resolve disocclusions should also be 
sent to the decoder. We propose below a new data representation method for interactive multiview navigation based on these 
two important ideas. 

! In this work we make the Lambertian hypothesis, i.e., we assume that a voxel reflects the same color even when viewed from different viewpoints. 
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Second, innovation can also be generated by some new elements that appear due to a variation in the object resolution, i.e., 
when an object is growing from one viewpoint to another one. In other words, two consecutive pixels representing the same 
object in X could map to two non-consecutive ones in X' (even if they still describe the same object), and let the pixel(s) 
between unmapped to other pixels. This is due to the bijection assumption introduced above. However, we have chosen to 
restrict our study to the handling of disocclusions and we assume that these resolution-variation missing pixels are recovered 
by a simple interpolation of the neighboring available pixels (of the same object). This assumption remains reasonable if we 
consider a navigation without large forward displacements. This is actually what is classically done in the view synthesis papers 
1 27 ], [28]. Therefore, in the experiments, we will only consider navigation trajectories that remains at a similar distance from 
the scene. 

innovation Ix,X' — Sx \ Sx> 




Fig. 2. Illustration of visibility of scene elements in the images X and X' . The innovation of X' with respect to X is represented with the black line. 



C. Navigation domains 

We define the new concept of navigation domain as a contiguous region that gathers different viewpoints of the 3D scene S, 
with each of these viewpoints being available to the users to be reconstructed as a 2D image (see Fig. [3]). This is an alternative 
to the classical image-based representation used in the literature, where a scene is represented by a set of captured views (32). 
In our framework, the concept of captured camera or virtual view does not exist anymore, in the sense that all the images of 
the navigation domain are equivalent. We denote by c(X) eW the camera parameter vector associated to the image X: 

c(X) =c x = [t x t y t z X e y Z ] T . 

V V 

translation rotation 

From these parameters, we define the navigation domain as a continuous and bounded domain C eR p . We associate to C the 
dual image navigation domain: X = {X\cx £ C}. In the following, a navigation domain (ND) refers to both the set C and its 
dual definition. 





Fig. 3. The navigation domain can be ID or 2D, and is defined by the set of camera parameters C. 



The new concept of navigation domains permits us to have a general formulation of the view switching problem. Naturally 
it also leads to novel data representation methods. The main idea of our novel approach is first to divide the navigation domain 
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into non-overlapping partitions, Xi, called navigation segment. In other words, we have X = (J^ X% with X% H Afj ? = for all 
z and j. Then, we represent all the views in one segment with one signal, which is used at decoder for user view switching. 

Each navigation segment is first described by one reference image, called Y. This image is important as it is used for the 
baseline reconstruction of all images in the segment. We thus denote the navigation segment as X(Y), which represents the set 
of images that are reconstructed from a reference Y at the decoder. The reference image completely determines the rest of the 
navigation segments under some consideration about the geometry of the scene and the camera positions, as explained later. 
We therefore use the simplified notation X(Y) for uniquely describing a navigation segment. The part of the scene visible 
from the reference image Y is called Sy = fyiX) * s illustrated in plain red lines in Fig. [5]). At the decoder, a user-selected 
image X is reconstructed using depth-image-based rendering techniques (DIBR [33]) that project the frame Y onto X, i.e., 
the decoder builds f x ^{S Y ). The decoder is thus missing the elements of information in X \ f^ 1 (Sy) for each view X of 
the navigation segment. Some of these missing elements in different X's map to the same voxel. This is why we merge the 
different innovations and we define the global segment innovation as 

|J S X \S Y . (1) 

xex(Y) 

It correspond to a global information that is missing in Y to recover the whole navigation segment. It is represented in blue 
dashed lines in Fig. |4j The objective is to design an auxiliary information, for transmitting this set ^ to the decoder, that is 
very light and takes the coded form ip = h(<I>). 

, , , 

objects of the scene 

— part of the scene visible 
from the reference camera 
$Y = fy(Y) 

reference camera segment innovation 

$ = |J Sx\S Y 
xex(Y) 

navigation segment 
X(Y) 

reference image 

Fig. 4. Illustration of the concept of navigation segment for a simple scene (represented from top) with one background (vertical plane) and two foreground 
objects (vertical rectangles). 

Equipped with our new data representation method, we can finally describe our communication system in detail. We assume 
that a server stores the different navigation segments that compose the whole navigation domain. This storage has a general cost 
r. At the receiver, a user navigates among the views, chooses to build a 2D image X at a viewpoint described by parameters 
cx £ C. The only constraint in the navigation is that the user cannot choose randomly his viewpoint, he has to switch smoothly 
to the neighboring images. We define a distance S between two camera parameter vectors c and d as S : (c, c') —> 5(c, d). This 
distance is computed between the camera parameters vectors. We can consider different distances wether we want to emphasize 
rotation or translation. We note also 5 : (X, X') — >> 5(cx,cx') the dual distance between two images X and X' . Since C is 
a continuous set, we define A as the navigation step, which corresponds to the distance between two different images chosen 
at consecutive instant. We assume that the user can send his position in the navigation domain every Nt framq^] Once the 
user sends its position, the server transmits all the navigation segments where the user might need in the next Nt instants. 
We define the navigation ball as the set of achievable viewpoints in the next N T instants from the viewpoint X as: 

B(X, N T A) = {X' e X\S(X, X') < N T A}. (2) 

In other words, the server sends all X(Yi) that verify X(Yi) n B(X,N T A) ^ 0. Finally, the user navigation depends on the 
a priori view popularity distribution, p(X) (X G X), which corresponds to a dense probability distribution over the views. It 
describes the relative popularity of the viewpoints (we have j XeX p(X) = 1) as in practice not all viewpoints have the same 
probability to be reconstructed at decoder. 

D. Data coding 

We now describe the data coding method that is used in this paper. This method is not fully optimized in terms of compression 
and is certainly not exclusive; it however nicely fits the design choices described above. Recall that each navigation segment 
is composed of one reference image Y and some auxiliary information cp = h(<&), where ^ is the segment innovation. First, 

2 If / is the frame rate, can be expressed in seconds by dividing the value expressed in number of frames by / 
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the images Y (color and depth data) are coded and stored using classical intra frame codecs such as H.264/AVC Intra [8]. We 
use such reference images to generate all the other views of the navigation segment X(Y) via view synthesis. As explained 
before, the set of frames X G X(Y) \ Y contains a certain innovation <P that represents the global novelty of the views in 
the navigation segment with respect to Y. In practice, we estimate this set as follows (it is also illustrated in Fig. [5]). We first 
project the image Y in the 3D scene using depth information (in other words, we compute Sy). Then, we project every frame 
X from the segment X(Y) in the 3D scene using depth information (i.e., we compute Sx)- In our representation, each pixel 
is associated with a voxel in the 3D space and some voxels are shared by two images. In practice, ^ is the union of voxels 
visible in views in X(Y) but not visible in Y. In order to avoid redundancies, the voxels shared by different views in X(Y) 
are only represented once in In the following, we will use the concept of size of which simply corresponds to the number 
of voxels in the set denoted as \<&\. We will see that this size has a strong impact on the rate of the auxiliary information, 
denoted by \(p\. 



Voxels in scene S 




Fig. 5. Example of ^ construction, when images Y, X\ and X2 are projected to the 3D scene. In that example, ^ is made of 9 voxels, thus = 9. 

We still have to encode the auxiliary information for reducing its size. The encoding function (i.e., the function ip = h{<&)) 
consists in building a quantized version of DCT blocks from the auxiliary information image, which can gather the segment 
innovation when cameras are aligned. The innovation segment image is then divided into small pixel blocks that are DCT 
transformed and quantized. The bitstream is then encoded with a classical arithmetic coder. If the navigation domain is more 
complex, this approach can be extended to the layered depth image (LDI l34li ) format. In that case, auxiliary information in 
each layer can be DCT transformed and quantized. 

At the decoder we exploit the auxiliary information in a reconstruction strategy that is based on the Criminisi's inpainting 
algorithm (30). It is made of two steps. During the first one, the algorithm chooses the missing image patch that has the highest 
priority based on image gradient considerations. The second step fills the missing information by using a similar patch from 
the reconstructed parts of the image. We modify Criminisi's inpainting algorithm by introducing in this second step a distance 
between the candidate patch and the auxiliary information. The hole filling technique thus chooses a patch that corresponds 
to the auxiliary information h($). This auxiliary information is deduced from the cp signal and projected to the patch position 
using depth information. Finally, it is important to note that the design of the auxiliary information coding technique does not 
depend on the decoder. Also, the reconstruction technique is independent of the type of hash information that is transmitted. 

III. Optimal partitioning of the navigation domain 

A. Constrained partitioning 

The new data representation proposed above raises an important question, namely the effective design of the navigation 
segments. We show here how the partitioning can be optimized under rate and storage constraints. Recall first that the objective 
is to represent the navigation domain as the union of Ny navigation segments. Thus, we have X = (J^l <^(^i)> where X{Yi) 
is the set of images reconstructed from the reference image Y{ and the associated auxiliary information. 

Let us study in a first step a simple case, which will permit to define the new concept of similarity between two frames. We 
assume that Ny is given and that the references images Y^'s are already fixed. A naive way of defining a navigation segment 
X(Y) consists in decomposing the ND based on the distance between cameras: 

Vi E [1, Ny], X(Y % ) = {Y % } U{Xe X\ij ^ i, 6(X, Yi) < 5{X, Y 3 )}. (3) 

This definition leads to equidistant reference image distribution over the navigation domain as shown in Fig. |7ja). However, 
this definition takes into account neither the scene nor the notion of innovation between the images. This is why we define the 
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geometrical similarity 7 between two images as (for sake of conciseness, geometrical similarity will be replaced by similarity 
in the rest of the paper): 

7 : (x,x') 1 (x,x , ) = \s x ns x ,\. (4) 

This similarity definition lays the foundations of a new kind of correlation between images, when two images share a set of 
identical pixels but also contain sets of independent pixels. In other words, instead of considering a model where the correlation 
between two images is an error all over the pixels, as it is classically adopted in image coding, we use a model where, two 
pixels of two different images are either equal or totally independent. This new kind of correlation between images is measured 
by the similarity function of Eq. ([4]). This leads to a novel partitioning defined as: 

Vz e [1, N v ], X(Y % ) = {Yi} U {X g X\Vj + z, 7 (X, r<) > 7 (X, K,)}. (5) 

It reflects the quantity of innovation between two images and leads to non-equidistant partitioning. Typically, the navigation 
segments are smaller if the similarity varies quickly with the distance between cameras (Fig. |7Jb)). To illustrate the fact that 
similarity is not linearly dependent on the distance between cameras, we present a simple experiment in Fig. [6] For the Ballet 
sequence [35], we build a navigation domain made of 100 equidistant viewpoints (in the sense of the camera parameters). For 
two reference images (index 1 in Fig. [6|a) and 50 in Fig. [6jb)) we calculate their similarity with all the other frames of the 
navigation domain. The similarity is expressed here between and 1 and corresponds to a percentage of common pixel^] 
i.e., of pixels that are associated to the same voxels in the 3D scene. We can actually see that the evolution of the similarity 
function is not linear with the view index nor with the distance, i.e., the non-linear (plain red lines) interpolation fits better 
the similarity function than the linear one (dashed black line). 




' -5 20 40 60 80 100 ° -75 20 40 60 80 100 



view index view index 

(a) similarity with respect to view 1 (b) similarity with respect to view 50 

Fig. 6. Similarity evolution (blue crosses) in function of the view index of a navigation domain, in which the images are equidistant. Black dashed line 
corresponds to linear interpolation between the extreme values while red plain curve is a non-linear interpolation of the curve which obviously fits better the 
similarity evolution. 



reference image ^ reference image 

Yi , Y 2 . Y 3 , r 4 , Y 5 , , Y 1 , Y 2 , Y 3] Y 4 , Y 5 



navigation segment navigation segment 

(a) distance-based (b) similarity-based 

Fig. 7. Visual difference between the two distance-based and similarity-based partitioning for ID navigation domain. 



Equipped with this new fundamental notion of similarity, let us further develop the concept of optimal partitioning. It is 
obtained by fixing the right number of reference views Ny and choosing the proper reference images Y* . Partitioning is 
optimized with respect to a storage size r, which corresponds to the total cost of storing all the navigation segments, and 
with respect to a rate R, which corresponds to an average transmission cost. We assume that the navigation step A is fixed. 
We further assume that if a user starts its navigation on a reference frame, the navigation segment is sufficiently big to 
enable independent navigation during N T frames without transmission of another navigation segment. The optimal partitioning 
problem can be posed as: 

3 The similarity is normally defined as a number of voxels in the 3D space, however, for this test, we have chosen to divide it by the size of the image in 
order to obtain a value between and 1, which makes the interpretation easier. 
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(Ny, {Y*}) = arg min R(N V , {Yi})) under the constraint that r(N v , {FJ) < T max . (6) 

We rewrite the above as an unconstrained problem with help of a Lagrangian multiplier A as 

(iV*,{y;» = ^gmmR(N V: {Y i }))^\r(N v ,{Y i }) (7) 

(Nv,{Yi}) 

with S(Ny, {Yi}) = J2i=i 1^1 + 1^1 an d ^max is a storage constraint. The storage r depends on both the size of the reference 
frame \Yt\ and the auxiliary information \(fi\, with <p = h(<P) being the coding function for each navigation segment. The 
transmission rate R corresponds to the expected size of the information to be sent after each request and measures the average 
size of the navigation segment. Note that, formally, it differs from the notion of bandwidth, expressed in bit per second. It 
depends on navigation models or view popularity and is written as: 

Nv 

R(N v ,{Yi}) = J2P(X(Yi))R(X(Yi)) 
i=i 

N v 

= + (8) 

i=l 

where P(X(Yi)) = f X ex(Y)P(X) corresponds to the probability that the user navigates in segment X(Y), as proposed in 
Sec. [n| We assume that the relation between the coding rate and the auxiliary information has been characterized a priori, as 
it depends on the coding method. We propose a method to solve this optimization problem in the next section. 

B. Optimization method 

We first assume that Ny is given. We notice that the optimization problem of Eq. ^ is similar to a problem of vector 
quantization l36li . It consists in dividing the vector space in different partitions, represented by codewords that are chosen to 
minimize the reconstruction distortion under a rate constraint. Here we want to find a partitioning of the navigation domain 
that minimizes the rate under storage constraints, while the quality of the reconstruction is not affected. In Lloyd algorithm 
l36l for vector quantization, the positions of the codeword determine the quantization cells; similarly the position of Yi 
determines X(Yi) in our partitioning problem. More precisely, in an ideal case, the definition of the navigation segment 
becomes X(Yi) = argmin| X |(|F| + \cp\). We consider that it can be achieved from Eq. ([5j), which builds the segment with 
the elements that have a higher similarity with the reference frame than with the reference frames of the other navigation 
segments. Intuitively, the similarity is a correlation measure between two frames, which permits to form navigation segments 
when the reference frames are given. The problem now consists in selecting the reference frames Y^'s. We consider a simple 
iterative algorithm that performs three steps: 

• step 1: initialize the reference frames Y^'s 

• step 2: derive optimal navigation segments given the reference frames Y^'s, based on frame similarity criteria in Eq. ^ 

• step 3: refine the reference frame in each navigation segment in order to minimize storage and rate costs in Eq. ([7]). 
The algorithm then proceeds iteratively with an alternation between steps 2 and 3. It ends when the refinement in step 3 does 
not provide a significant storage and rate gain. Convergence is guaranteed, because the same objective R + \r is minimized 
in steps 2 and 3, and the objective function is bounded from below. 

It remains now to define the number of segments, i.e., the value Ny. For that purpose, we need to define a maximum number 
of navigation segment M. It corresponds to the case where all the segments have the minimum acceptable size, i.e., the size 
of the navigation ball defined in Eq ([2]). We write M as follows : 

M = size(ND) 

size(B(X,N T A))' 

The size of C C C is defined as f xeC t(x)dx (where 1 is the classical indicator function). It results that Ny lies between 1 
and M. We then determine Ny as 

Ny = arg min r + fiR m ^ (9) 

1<N V <M 

where i? max is the maximum navigation segment size that the user receives per request during navigation. The parameter [i 
corresponds to a parameter that permits to regulate the relative importance of rate with respect to storage cost. To solve Eq. ([9]), 
we calculate the storage and rates values for Ny taking each integer value between 1 and M: 

• r = Ny \Y\ +Ny\(p\, where \Y\ and \cp\ are rough estimations of the average reference frame rate and reference auxiliary 
information rate. They are deduced from the coding strategy adopted for ip = h($). 
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Finally we have the optimal value of the number of navigation segments by exhaustive search, as 

AT* = argmin(AV + /x)|y| + (JVy + /i)M- (10) 

1<N V <M 

IV. Experiments 

A. Setup 

Our novel interactive system is tested on two well-known multiview sequences provided by Microsoft research l35l| ^| namely 
Ballet and Breakdancer. Each of these sequences is composed of eight texture and depth videos and their associated intrinsic 
and extrinsic parameters. From these multiview images, we build a navigation domain that is composed of 120 viewpoints 
(texture and depth), as illustrated in Fig. [5] We also build a 2D navigation domain which consists of 5 distinctive rows of these 
120 horizontal alignment of viewpoints. In order to create the viewpoints that are not present in the original sequences, we 
use view synthesis techniques |T|. All of the images (camera images and synthetic images) form our input dataset; they are 
considered as original images, and can be chosen as reference frames by the partitioning algorithm. Finally, we index images 
in this set of equidistant viewpoints from 1 to 120 for the ID navigation domain, and from (1,1) to (5, 120) for the 2D one. 




Fig. 8. Illustration of ID and 2D navigations domain used in the experiments for Ballet sequence. 



B. Disocclusion filling based on coded auxiliary information 

We first study the performance of disocclusion filling based on auxiliary information. This permits to validate the 
reconstruction strategy that is at the core of our new data representation method. An example of reconstruction with the 
proposed inpainting technique is shown in Fig. |9ja). It is obtained by first projecting a reference view Y on a virtual view X 
(see Fig.[9ja), the disocclusions are in white). Secondly the disoccluded regions are reconstructed using the classical Criminisi's 
algorithm (Fig.|9jb)) and our guided inpainting method (Fig.[9jc)). We can see that the reconstructed quality obtained with our 
method is very satisfying. Moreover, the side information used for this illustrative example is not heavy in terms of bitrate, as 
shown in Fig.[l0] In these experiments, we measure the rate (at different quantization steps) of the following schemes: a single 
view transmission, two views coded jointly, and our proposed representation (one view and the auxiliary information ip). We 
observe that the rate of our representation method is much smaller than the rate needed for sending two reference views for 
synthesis. 




(a) reference image projection on 




(b)X\f-\SY 



recovery without 




(c) X \ f x (Sy) recovery with 
DCT-based ip 



Fig. 9. Visual results for reconstruction of view 2 in Ballet using view 1 as reference image. We compare classical inpainting method (b) and proposed 
guided inpainting method (c). 



In the next experiments shown in Sections IV-C 
in the segment innovation, 
Eq. |7]), the rate and storage costs depend on 



to IV-E we have chosen to present the result in terms of number of voxels 
instead of rate and the storage costs. This choice is motivated by the following reasons. In 



which is the size (in kbits) of the auxiliary information. This auxiliary 



4 Since we are considering the navigation in a static image, we only consider the temporal instant 1. 
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Fig. 10. Rate comparison between different representation for the navigation segment: single view, two views, one reference image + auxiliary information. 



information is a compressed version of the segment innovation Intuitively, the quantity of information in cp is increasing 
when the number of elements in the segment innovation <P is increasing. We can also observe that the increase is linear 
with the auxiliary information design presented above (see Fig. 11). By showing the results in terms of \$\, we make the 



choice of presenting general results that can be adapted to any kind of encoding function h. However, in order to present 
also more concrete performance of our solution we show in Sec. |IV-F| some rate and storage results obtained with a practical 
implementation of the system (based on a auxiliary information constructed using DCT coefficients as introduced above). 




Fig. 11. Illustration of auxiliary information size \(p\ evolution as a function of the number of voxels in the segment innovation \<P\, for the practical 
implementation of DCT based auxiliary information design (q corresponds to the number of bits used to describe each DC coefficient). 



C. Influence of reference view 

We now study the influence of the position of a reference view within a navigation segment. One of the strengths of the 
proposed representation is to avoid the differentiation between captured and synthesized views. Every frame is considered with 
the same importance, which gives a new degree of freedom in navigation performance optimization via proper selection of 
the reference view Yi. We evaluate the impact of the position of Yi on the size of the segment innovation We illustrate in 
Fig. 12 and 13 the typical evolution of ^ as a function of Yi, in ID and 2D navigation domains. More precisely, we fix the 
navigation segments and vary the position of {Yi}. For each position, we calculate |^| as explained in Sec. II-D We see that 
the shape of these curves is approximately convex, but non regular and non symmetric. The size of the auxiliary information 
clearly depends on the scene content. We see that the position of Yi has a strong impact on the size of the segment innovation, 
and therefore on the rate of the encoded auxiliary information, since \cp\ depends on We see that the size \<P\ can even 
double, depending on the Y position. 



D. Optimal partitioning 

We discuss now the results of the optimized partitioning algorithm and its effect on the size of the segment innovation 
We assume here that the number of partitions Ny is predetermined. We run the algorithm with databases introduced in 
Sec. IV- A Since the shape of the criterion function in our optimization problem is not completely convex, one needs to be 
careful in the initialization of the algorithm in order to avoid local minima. We put the initial reference frames at equidistant 



positions (in terms of geometrical similarity), which has been shown experimentally to be a good initial solution. In Fig. 14 



and 15 we show the performance of our algorithm in the partitioning of a ID navigation domain. In each of these figures 
we show (a) the evolution of the partitioning and (b) the evolution of the segment innovation \<&\ through the successive steps 
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Fig. 12. Size of the segment innovation <P (measured in number of voxels) as a function of the reference frame position (expressed in terms of camera index 
within the general navigation domain) for a ID navigation domain. 
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Fig. 13. Size of the segment innovation (measured in number of voxels) as a function of the reference frame (expressed in terms of camera index within 
the general navigation domain) position for a 2D navigation domain. 



of the partitioning algorithm. We see that, for Ny = 2 and Ny = 3, the total segment innovation decreases and the size of 
each segment innovation ^ converges to a similar value. We also remark that the algorithm converges in a small number of 



steps. We show in Fig. 16 and 17 similar results for the partitioning of a 2D navigation domain. We illustrate the final 2D 
partitioning and the evolution of the segment innovation size along the steps of the algorithm. We can see that the algorithm 
converges quickly and manages to decrease the size of the neighborhood innovation It is interesting to notice that the 
resulting partitioning does not correspond to an equidistant distribution of the reference frames (in terms of camera parameter 
distance). The proposed algorithm takes into account the scene content in the definition of the navigation segments. 



E. Optimal number of navigation segments 

We now study how the system determines the appropriate number of navigation segments. The optimal number of navigation 



segments Ny is determined by minimizing the criterion given in Eq. (10). We show in Fig. 18 the shape of this criterion 



function with different values of the relative weight factor /x. For these tests, we have considered that the coding function 



h is linear (i.e., cp increases linearly with <£), as it is experimentally obtained in Fig. 11 Then, for each value of Ny, we 
roughly estimate the storage and maximum rate costs for the Ballet sequence. We see that we obtain different optimal number 
of navigation segments Ny depending on the parameter \i that trades off storage and rate costs. We see that, if \i is large (i.e., 
more importance is given to the rate cost), the algorithm selects a high number of navigation segments. On the contrary, if the 
storage cost has more importance, the system prefers a small number of navigation segments. The parameter /i thus regulates 
the importance of the storage cost with respect to rate costs. This parameter is determined during the design of the system and 
depends on the system constraints and network delivery conditions. 
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Fig. 14. Partitioning results for two ID navigation segments when the initial reference frames are set at positions 40 and 80 (camera indexes). On the left, 
the evolution of the partitioning is illustrated as a function of algorithm evolution expressed as (a, b), where a is the iteration and b is the step. On the right, 
the corresponding \<P\ evolution is plotted. 



E Rate and storage performance 



So far, we have mainly presented partitioning results in terms of the size of the segment innovation \<3>\ which is directly 
related to the rate and storage costs. We now present results that illustrate the performance of our algorithm in terms of rate 
values. We encode the auxiliary information with a quantized DCT representation as introduced in Sec. II-D which leads to 
a linear relation between the rate and the size \<f>\ (as illustrated in Fig. 11). We first model a possible navigation path for a 
user navigation of a duration of 100s. The obtained path is represented in Fig.[l9ja). For this navigation path, we simulate the 
communication of the client with the server and we plot the evolution of the bit rate at client in Fig. 19 1)), with the initial 
partitioning and the one optimized with our algorithm. Here, the initial partitioning corresponds to the regular distribution 
of reference frames at the initialization of the optimization algorithm. We further plot the cumulative rate of the navigation 
instance as a function of time for the same two partitioning solutions. We see that the rate per second significantly decreases 
with the optimal partitioning; similarly the cumulative rate after 100s of navigation is also smaller when the partitioning is 
optimal. Similar results have been obtained for different navigation paths and different values of the number of navigation 
segments Ny. To generalize these results, we have averaged the cumulative rate after 100s for 100 navigation paths, and for 
different values of N T . We show the results in Fig. 20 We can see from all these representative results that the partitioning 
optimization leads to significant rate cost reductions. This validates our partitioning optimization solution. 

Finally, in order to figure out the efficiency of the proposed representation method, we compare the storage cost of the 
proposed system with a naive solution. The latter consists in jointly compressing the 8 captured images (with JMVM [37 ]), 
and in interpolating the other frames with bidirectional DIBR, as it is classically done in the literature. We use two different 
partitioning solutions (Ny = 2 and Ny = 3) and we use the DCT-based auxiliary information coding explained in Sec. |II-D 



The storage cost calculated contains the transmission of the reference image (color and depth) and the auxiliary information 
(for our solution). The quality is estimated over a representative sample of images (8,23,38,53,68,83,98,113 and the 8 
reference views). The results are shown in Fig. 21 where we see that the proposed representation obtains similar compression 
performance as the 8 views in JMVM without auxiliary information. However, since the use of JMVM compression is not 
realistic in an interactive scenario, the proposed representation certainly results in very promising performance even if the 
compression of the auxiliary information is not fully optimized yet. 
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Fig. 15. Partitioning results for three ID navigation segments when the initial reference frames are set at positions 10, 60 and 100 (camera indexes). On 
the left, the evolution of the partitioning is illustrated as a function of the algorithm evolution expressed as (a, b), where a is the iteration and b is the step. 
On the right, the corresponding |^| evolution is plotted. 




Horizontal camera Index 



(a) 



- Segment 1 

- Segment 2 

- Total 



algorithm'evolution 

(b) 



Fig. 16. 2D partitioning results for two navigation segments with initial reference frames at positions (3, 30) and (3, 60) (camera indexes). On the left, the 
final partitioning is illustrated; on the right, the evolution of the innovation size \<P\ is plotted as a function of the algorithm evolution expressed as (a, b), 
where a is the iteration and b is the step. 



V. Conclusion 

In this paper, we propose a novel data representation method for interactive multiview imaging. It is based on the notion of 
navigation domain which is optimally splitting into several navigation segments. Each of these navigation segments is described 
with one reference image and auxiliary information, which enables a high quality user navigation at the receiver. In addition of 
this novel concept, we have proposed an effective solution to partition the navigation and find the best position for reference 
images. Experimental results show that the viewing experience of the user is significantly improved with a reasonable rate and 
storage cost. The comparison with the existing standards are encouraging and show the potential of such approach for future 
systems. Future work will mainly focus on the extension of this idea to temporal aspects in order to enable efficient interactive 
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Fig. 17. 2D partitioning results for three navigation segments with initial reference frames at positions (3, 10), (3, 60) and (3, 100) (camera indexes). On 
the left, the final partitioning is illustrated; on the right, the evolution of the innovation size \<P\ is plotted as a function of the algorithm evolution expressed 
as (a, b), where a is the iteration and b is the step. 
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Fig. 18. Optimal number of navigation segments Ny for different values of relative weight-factor /i. 
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